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Abstract 

Networks can be considered as approximation schemes. Multilayer networks of the 
backpropagation type can approximate arbitrarily well continuous functions (Cybenko, 
1989; Funahashi, 1989; Stinchcombe and White, 1989). We prove that networks de- 
rived from regularization theory and including Radial Basis Functions (Poggio and 
Girosi, 1989), have a similar property. From the point of view of approximation the- 
ory, however, the property of approximating continuous functions arbitrarily well is not 
sufficient for characterizing good approximation schemes. More critical is the property 
of best approximation. The main result of this paper is that multilayer networks, of the 
type used in backpropagation, are not best approximation. For regularization networks 
(in particular Radial Basis Function networks) we prove existence and uniqueness of 
best approximation. 
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1 Introduction 

Learning an input-output relation from examples can be considered as the problem of ap- 
proximating an unknown function f(x) from a set of sparse data points (Poggio and Girosi, 
1989). From this point of view, feedforward networks are equivalent to a parametric approx- 
imating function F(W, x). As an example, consider a feedforward network, of the multilayer 
perceptron type, with one hidden layer; the vector W corresponds, then, to the two sets 
of "weights," from the input to the hidden layer, and from the hidden layer to the output. 
Even before considering the problem of how to find the appropriate values of W for the 
set of data, the fundamental representational problem must be approached: which class of 
mappings / can be approximated by F, and how well? The neural network held has re- 
cently seen an increasing awareness of this problem. Several results have been published, 
all showing that multilayer perceptrons of different form and complexity can approximate 
arbitrarily well a continuous function, provided that an arbitrarily large number of units 
is available (Cybenko, 1989; Funahashi, 1989; Moore and Poggio, 1988; Stinchcombe and 
White, 1989; Carrol and Dickinson, 1989). This property is shared by algebraic and trigono- 
metric polynomials, as is shown by the classical Weierstrass Theorem, and for this reason 
we shall refer to it as the Weierstrass property, results of this type should not be taken to 
mean that the approximation scheme is a "good" approximation scheme. An indication of 
the latter point is provided, in the case of multilayer perceptron networks, of the type used 
for backpropagation, by a closer look at the published results. Taken together, they imply 
that almost any nonlinearity at the hidden layer level and a variety of different architectures 
(one or more hidden layers, for instance) insures the Weierstrass property (Funahashi, 1989; 
Cybenko, 1989; Stinchcombe and White, 1989). There is nothing special about sigmoids, 
and in fact many classical approximation schemes exist that can be represented as a network 
with a hidden layer and that exhibit the Weierstrass property. In a sense this property is 
not very useful for characterizing approximation schemes, since many schemes have it. Lit- 
erature in the held of approximation theory reflects this situation, since it emphasizes other 
properties in characterizing approximation schemes. In particular, a critical concept is that 
of best approximation. An approximation scheme has the best approximation property if in 
the set A of approximating functions (for instance the set F(W, x) spanned by parameters 
W) there is one that has minimum distance from any given function of a larger set $ (a 
more formal definition is given later). Several questions can be asked, such as the existence, 
uniqueness, computability, etc., of the best approximation. 

In this paper, we show that feedforward multilayer networks of the backpropagation 
type (Rumelhart et al., 1986, 1986a; Sejnowski and Rosenberg, 1987) do not have the best 
approximation property for the class of continuous functions defined on a subset of R n . On 
the other hand, we prove that for networks derived from regularization, and in particular for 
radial basis function networks, best approximation exists and is unique. We also prove that 
these networks approximate arbitrarily well continuous functions (see Appendix B and C). 
We have recently shown that radial basis function approximation schemes can be derived 
from regularization and are therefore equivalent to generalized (radial) splines (Poggio and 
Girosi, 1989). For Radial Basis Function networks we prove existence and uniqueness of best 



approximation. 1 

The plan of the paper is as follows. We hrst formalize the previous arguments, then 
introduce some basic notions from approximation theory. Next, we prove that multilayer 
networks of the type used for backpropagation do not have the best approximation property, 
and that networks obtained from regularization theory have this property. In the last section, 
we discuss the implications of these results and list some open questions. Appendix B proves 
that the Stone- Weierstrass theorem holds for Gaussian Radial Basis Function networks (with 
different variances). In appendix C we prove a more general result: regularization networks 
approximate arbitrarily well any continuous function on a compact subset of R n . 

2 Most networks approximate continuous functions 

In recent years there have been attempts to find a mathematical justification for the use of 
feedforward multilayer networks of the type used for backpropagation. Typical results deal 
with the possibility, given a network, of approximating any continuous function arbitrarily 
well. In mathematical terms this means that the set of functions that can be computed 
by the network is dense (see Appendix A) in the space of the continous functions C[U] 
defined on some subset U of R d . The most recent results (Cybenko, 1989; Funahashi, 1989; 
Stinchcombe and White, 1989) consider networks with just one layer of hidden units, that 
correspond to the following class of approximating functions: 



E = {/ G C[U] | /(x) = J2 C M* • w * + 0O> U C R d , w t - G R d , a, Oi G R, m G N} (1) 

8 = 1 

where a is a continuous function. Depending on <r, the set E may or may not be dense in 
the space of the continuous functions. The set T> of functions a such that E is dense seems 
to be large. For instance, the sigmoidal functions, that is functions such that 

lim cr(t) = 1 

t^ + oo v ' 

lim cr(t) = 

t->— oo v ' 

belong to T> (Cybenko, 1989; Funahashi, 1989). Many other types of functions in T> can be 
found in the paper of Cybenko (1989). The set T> has been recently extended by the result 
of Stinchcombe and White (1989). In fact they prove that it contains all the functions whose 
mean value is different from zero and whose _L p -norm is finite for 1 < p < oo. 

Other networks can be built, such that the corresponding set of approximating functions is 
dense in C[£/]. Consider for example the network in figure 1. This is the most general network 
with one layer of hidden units, and the class of approximating functions corresponding to it 
is 



^^The theory has been extended by introducing the more general schemes of GRBF and HyperBF, which 
can be considered as the network equivalent of generalized multidimensional splines with free knots. 




Figure 1: The most general network with one layer of hidden units. Here we show the two- 
dimensional case, in which x = (x } y). Each function Hi can depend on a set of unknown 
parameters, that are computed during the learning phase, as well as the coefficients c 8 -. When 
Hi = <r(x- w, -\-9i) a network of the backpropagation type is recovered, while Hi = i7(||x — 1 8 ||) 
corresponds to RBF or GRBF scheme (Broomhead and Towe, 1988; Poggio and Girosi, 
1989). 



Af = {fe C[*7]|/(x) = J2aHi(x),U c R d ,H t e C[U],me N}. 



8 = 1 



The function Hi are of the form Hi = i7(x;W 8 ), where W; is a vector of unknown pa- 
rameters in some multidimensional space and H is a continuous function. If the Hi are 
appropriately chosen the set J\f can be dense in C[£/]. For example the Hi could be alge- 
bric or trigonometric polynomials, and in this case the denseness of J\f would be a trivial 
consequence of the Stone- Weierstrass theorem (see Appendix B). This theorem allows a sig- 
nificant extension of the set of "basis" functions Hi. Appendix B gives another example, 
showing how Gaussian functions of radial argument (and different variances) can be used to 
approximate any continuous function. Appendix C provides a more powerful result showing 
that all networks derived from regularization theory can approximate arbitrarily well contin- 
uous functions on a compact subset of R n . This result includes, in particular, Radial Basis 
Functions networks with the radial basis function being the Green's function of a self-adjoint 
differential operator associated to the Tikhonov stabilizer. Such Green's functions include 
most of the known approximation schemes, such as the Gaussian and several types of splines 
and many functions, but not all functions, that satisfy some sufficient conditions given by 
Micchelli (1986) in order to be interpolating functions. 



Since a large number of networks can approximate arbitrarily well any continuous func- 
tions, it is natural to ask whether this property is really important from the point of view of 
approximation theory, and whether other more fundamental properties can be characterized. 
As we mentioned already, one of the basic properties that an approximating set should have 
is the best approximation property, that guarantees that the approximation problem has a 
solution. The next section focuses our attention on the relationship between this property 
and different kind of networks, since this seems to be a more appropriate starting point for a 
complete analysis of the networks performances from a rigorous mathematical point of view. 

3 Basic facts in approximation theory 
3.1 The best approximation property 

An informal formulation of the approximation problem can be stated as follows: given a 
function f belonging to some prescribed set of functions $ ; and given a subset A of<&, find 
the element a of A that is the "closest" to f. 

In order to give this formulation a precise mathematical meaning, some definitions are 
needed. First of all a notion of "distance" has to be introduced on the set $. Since this set 
is usually assumed to be a normed linear space, with norm indicated by || • ||, the distance 
d(f } g) between two elements / and g of $ is naturally defined as ||/ — g\\. Given / £ $ and 
A C $ we can now define the distance of f from A as 

</,A) = inf||/-a||. (3) 

a£.A 

If the inhmum of ||/ — a\\ is attained for some element a of A, that is if there exists an 
a £ A such that ||/ — a \\ = d(f, A), this element is said to be a best approximation to f 
from A. A set A is called an existence set (uniqueness set, resp.) if, to each / £ $, there is at 
least (at most, resp.) one best approximation to / from A. If the set A is an existence set we 
will also say that it has the best approximation property. A set A is called a Tchebycheff set 
if it is an existence set and a uniqueness set. We are now ready to give a precise formulation 
of the approximation problem: 

Approximation problem: given / £ $ and A C $ find a best approximation to / 
from A. 

From the definition above it is clear that the approximation problem has a solution if 
and only if A is an existence set, and a large part of approximation theory has been devoted 
to proving existence theorems, which give sufficient conditions to guarantee existence and 
possibly uniqueness of closest points. We will only present very simple properties of sets 
with the best approximation property, and will apply these result to network architectures, 
in order to understand their properties from the point of view of approximation theory. 

We begin with the following observation: 

Proposition 3.1 Every existence set is closed. 



Proof. Let A C $ be an existence set, and suppose that it is not closed. Then there is a 
sequence {a n } of elements of A that converges to an element / that is not in A, that is there 
exists an / £ $\A such that 

lim d(f, a n ) = 

This means that d(f, A) = 0, and since A is an existence set there is an element a £ A such 
that ||/ — « || = 0. By the properties of the norm this implies that / = a , which is absurd 
because / £" A and a £ A. Then A must be closed. □ 

The converse of this proposition is not true, that is closedness is not sufficient for a set 
to be an existence set. However the stronger condition of compactness is sufficient, as the 
following theorem shows. 

Theorem 3.1 Let A be a compact set in a metric space $. Then A is an existence set. 

Proof. For each / £ $ the distance d(f, a), with a £ A, is a continuous real valued function 
defined on the compact set A. From theorem A. 2 of Appendix A it attains its maximum 
and minimum value on this set and this concludes the proof. □ 

In the next section we apply these simple results to some network architectures. 

4 Networks and approximation theory 

From the point of view of approximation theory a feedforward network is a representation of 
a set A of parametric functions, and the learning algorithm corresponds to the search of the 
best approximation to some target function / from A. Since in general a best approximation 
does not exist unless the set A has some properties (see, for instance, theorem 3.1), it is of 
interest to understand which classes of networks have these properties. 

4.1 Multilayer networks of the backpropagation type do not have 
the best approximation property 

Here we consider the class of networks of the backpropagation type with one layer of hidden 
units. The space $ of functions that have to be approximated is chosen to be C[£/], the 
set of continuous functions defined on a subset U of R d with some unspecified norm. If the 
number of hidden units is m, the functions that can be computed by such networks belong 
to the following set o m : 

m 

a m = {/ £ C[U] I /(x) = E c ^( x • w * + 0O> w,- £ R d ,c t ,9 t £ R} (4) 

8 = 1 

where cr(x) is usually a sigmoidal function. We now show that o m is not an existence set, 
and this does not the depend on the norm that has been chosen. The result is proved in the 
case of a being a sigmoid and for one hidden layer, cr(x) = (1 + e~ x )~ 1 } but it holds for every 
other non trivial choice of nonlinear function and for networks with more than one hidden 
layer. 



Proposition 4.1 The set o m is not an existence set for m > 2. 

Proof: A necessary condition for a set to be an existence set is to be closed. Therefore it 
is sufficient to show that o m is not closed, and this can be done by showing an accumulation 
point that does not belong to it. Let us consider the following function: 

M x ) 



$ \l _|_ e -[w-x+0] I _|_ e -[wx+(0+S)] 

Clearly f§ G cr m ,Vm > 2, but it easily seen that 

1 
lim/s(x) = flf(x 



s^o Joy ' yy ' 2(1 + cosh[w-x + #]) 

and g G - er m ,Vra > 2. For each m > 2 the function g is then an accumulation point of o m 
but does not belong to it: o m can not be closed and this concludes the proof. □ 
This result reflects a general fact in non linear approximation theory: usually the set of 
approximating functions is not closed, and its closure must be added to it in order to obtain 
an existence set. This is the case, for instance, for the approximation by ^-polynomials in 
one dimension, that are replaced by the extended ^-polynomials, to guarantee the existence 
of a best approximating element (Braess, 1986; Rice, 1964, 1969; Hobby and Rice, 1967; De 
Boor, 1969). 

4.2 Existence and uniqueness of best approximation for regular- 
ization and RBF 

One of the possible approaches to the problem of surface reconstruction is given by regular- 
ization theory (Tikhonov and Arsenin, 1977; Bertero et al. 1988). Poggio and Girosi (1989) 
have shown that the solution obtained by means of this method maps into a class of networks 
with one hidden layer (an instance of which are Radial Basis Function networks or RBF). 
In fact the solution can always be written in the parametric form: 

m 

/(x) = 5>&(x) (5) 

8 = 1 

where the c 8 - are unknown, m is the number of data points and the cf>i are fixed, depending on 
the nature of the problem and on the data points. More precisely the "basis function" cf>i is 
of the form </>;(x) = G(x; x 8 ), where x,- is a data point and G is the Green's function of some 
(pseudo)differential operator P (a term belonging to the null space of P can also appear, see 
Appendix C). In the particular case of radial function G = G(||x — x 8 ||) the RBF method is 
recovered, and the solution of the approximation problem is then a linear superposition of 
radial Green's functions G "centered" on the data points. 

Notice that this function can be computed by a network that is a special case of the 
one represented in figure 1. The main difference is that in the general case the functions G{ 
depend on unknown parameters, while in the regularization context only the coefficient c 8 - 
are unknown. 

6 



Equation 5 means that the approximated solution belongs to the subset T m of C[J7]: 

m 

T m = {fe C[U] I /(x) = £ c^.-(x), a e R} (6) 

8 = 1 

Since we have shown that the set of approximating functions associated with networks with 
one hidden layer of the type used for backpropagation does not have the best approximation 
property, it is natural to ask whether or not the set T m has this property 2 . The answer is 
positive, as is stated in the following proposition: 

Proposition 4.2 The set T m is an existence set for ra > 1 

Proof. Let / be a prescribed element of C[£/], and let a be an arbitrary point of T m . We 
are looking for the closest point to / in T m . It has to lie in the set 

{«er \\a-f\\<\\a -f\\}. 

This set is clearly closed and bounded, and by theorem A.l it is compact. The best approx- 
imation property comes from theorem 3.1. □ 

From this proposition we can see that every time that the approximating function is a 
finite linear combination of basis functions, the set that is spanned by these basis functions 
is an existence set for C[£/]. Depending on the norm that is chosen in C[U] the best approx- 
imating element can be unique. In fact the following theorem holds (see Appendix A for the 
definition of strictly convex): 

Proposition 4.3 The set T m , m > 1 is a Tchebycheff set if the normed space C[U] is 
strictly convex. 

Proof. The existence has already been proved. Suppose then that there are two best ap- 
proximating elements / and /' from T m to a function g £ C[£/]. Let A be the distance of g 
from T' m . Applying the triangular inequality we obtain : 

\\l(f + f')-3\\<l\\f-3\\ + \\\f'-g\\ = ^ (7) 

Since T m is a vector space, then |(/ + /') £ T m and by definition of A it follows that 
|||(/ + /')|| > A. This implies that the equality holds in equation 7. If A = it is clear that 
/ = /' = g. If A ^ 0, then we can write equation 7 as 



(f-g) , (f'-g) 

A A 



This means that the vectors ^ , \ 9 anc ^ their midpoints are all of norm 1, but since 
stricty convexity holds, then / = /'. □ 

Since it is well known that C[U] with the _L p -norms, 1 < p < oo is strictly convex (Rice, 
1964), we have then shown that in most cases regularization theory gives an approximating 
set with the best approximation property and with a unique best approximating element. 

2 Notice that multilayer perceptions of the type used for backpropagation cannot be derived from any 
regularization scheme since it cannot be written as the linear superposition of Green's functions of any kind. 
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5 Conclusions 

5.1 GRBF and Best Approximation 

We have recently extended the scheme of equation 5 to the case in which the number of basis 
functions is less than the number of data points (Poggio and Girosi, 1989; Broomhead and 
Lowe, 1988). The reason for this is that when the number of data points becomes large the 
complexity of the network may become too high, being proportional to the number of data 
points. A solution to the approximation problem is sought of the form: 

n 

/(x) = 5>G(x;t,-) (9) 

8 = 1 

where n is smaller than the number of data points and the positions of the "centers" t 8 - 
of the expansion are unknown, having to be found during the learning stage. Does the 
best approximation property hold for this approximation scheme, that we call Generalized 
Radial Basis Function (GBRF) method? The answer is no, exactly as for splines with free 
knots, to which equation 9 is in fact equivalent. By the same arguments we have used in 
section 4.1 we could show that the set G n of approximating functions generated by equation 
9 (the analogous of the set T m ) is not closed. The scheme, however, has almost the best 
approximation property in the following sense. The scheme already works satisfactorily if 
the centers t 8 - are fixed to a subset of examples or other positions. In this case G n is a 
linear space, and it is an existence set, as well as T' m . We could then have an algorithm 
in which hrst the centers are found independently (for instance by the K-means algorithm, 
see Moody and Darken, 1989) and then the c 8 - are obtained with gradient descent methods 
(see Poggio and Girosi, 1989). In this scheme the best approximation property is preserved, 
while the computational complexity has been reduced with respect to the exact solution of 
the regularization problem. 

There are other ways to make GRBF a best approximation. The most interesting ap- 
proach is to follow the theory of 7-polynomials (Braess, 1986; Rice, 1964, 1969; Hobby and 
Rice, 1967; De Boor, 1969) and complete the sets of basis functions with its closure, con- 
sisting of an appropriate number of derivatives of the Green's function with respect to its 
parameters, yielding a best approximation scheme. It seems very difficult to use either of 
these two approaches for networks of the type used for backpropagation. 

5.2 Open Questions 

We have not explored the practical consequences of the fact that multilayer networks of the 
backpropagation type are not best approximation. Intuitively, it seems that the lack of the 
best approximation property is related to possible practical degeneracies of the solution. In 
certain situations, because of the fact that the sigmoid, which is asymptotically constant, 
contains as an argument one set of parameters (the Wi), the precise values of these parameters 
may not have any significant effect on the output of the network. The same situation 
happens for GRBF when the centers inside the Green's function are unknown. In the GRBF 
case, however, we can freeze the t 8 - to reasonable values whereas this is impossible in the 
backpropagation case. 

8 



Other questions remain open as well. The most important questions from the viewpoint 
of approximation theory are: (1) the computation of the best approximation, i.e., which 
algorithm to use, (2) a priori bounds on the goodness of the approximation given some 
generic information on the class of functions to be approximated, and (3) a priori estimates 
of the complexity of the best approximation, again given generic information on the class of 
functions to be approximated. In the case of RBF, the latter question is directly related to 
the size of the required training set, and therefore to the deep issue of sample complexity 
(see Poggio and Girosi, 1989, section 9.3). About problems 1) and 2) notice that in practical 
cases it may be admissible to use a scheme which is not best approximation, if it provides 
an almost as good approximation at a much lower computational cost. 
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A Definitions and basic theorems 

We review here some of the definitions that have been used in the paper. Every set will be 
assumed to have the structure of metric space, unless differently specified, and the concepts 
of limit point, inhmum and supremum are assumed to be known. All these definitions and 
theorems can be found in any standard text on functional analysis (Yosida, 1974; Rudin, 
1973) and in many books on approximation theory (Braess, 1986; Cheney, 1981). 
An important concept is that of closure: 

Definition A.l If S is a set of elements, then by the closure [S] of S we mean the set of 
all points in S together with the set of all limit points of S . 

We can now define the closed sets as following: 

Definition A. 2 A set S is closed if it is coincident with its closure \S\. 

A closed set then contains all its limit points. Another important definition related to the 
concept of closure is that of dense sets: 

Definition A. 3 Let T a subset of the set S. T is dense in S if [T] = S. 

If T is dense in S then each element of S can be approximated arbitrarily well by elements 
of T. As an example we mention the set of rational numbers, that is dense in the set of real 
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numbers, and the set of polynomials that is dense in the space of continuous functions (see 
appendix B). 

In order to extend some properties of the real valued functions defined on an interval to 
real valued functions defined on more complex metric spaces it is fundamental to define the 
compact sets: 

Definition A. 4 A compact set is one in which every infinite subset contains at least one 
limit point. 

It can be shown that, in finite dimensional metric spaces, there exists a simple characteriza- 
tion of compacts sets. In fact the following theorem holds: 

Theorem A.l Every closed, bounded, finite-dimensional set in a metric linear space is 
compact. 

The well known Weierstrass theorem on the attainment of the extrema of a continuous 
function on an interval can now be extended as following: 

Theorem A. 2 A continuous real valued function defined on a compact set in a metric space 
achieves its infimum and supremum on that set. 

A subset of the metric spaces is given by the normed spaces, and among the normed spaces, 
a special role is played by the strictly convex spaces: 

Definition A. 5 A normed space is strictly convex if: 

\\f\\ = \\g\\ = \\l(f + g)\\ = ^f = g 

The geometrical interpretation of this definition is that a space is strictly convex if the unit 
sphere does not contain any line segment on its surface. 

B Gaussian networks and Stone's theorem 

It has been proved (Cybenko, 1989; Funahashi, 1989) that a network with a one hidden 
layer of sigmoidal units can approximate a continuous function arbitrarily well. Here we 
show that this property, which is well known for algebraic and trigonometric polynomial 
approximation schemes, is shared by a network with Gaussian hidden units. The proof is 
a simple application of the Stone- Weierstrass theorem, which is the generalization given 
by Stone of the Weierstrass approximation theorem (Stone, 1937, 1948). Our result was 
obtained independently from the equivalent proof of Hartman, Keeler and Kowalski (1989). 
We hrst need the definitions of algebra. 

Definition B.l An algebra is a set of elements denoted by y, together with a scalar field 
J- ', which is closed under the binary operators of + (addition between elements ofy), X 
(multiplication of elements ofy), ■ (multiplication of elements in y by elements from the 
scalar field T), such that 
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1. y together with T , + and ■ forms a linear space, 

2. if f, g } h are in y, a is in T , then 
f X g is in y , 



f x [g x h) = (/ x g) x h, 
f x (g + h) = f x g + f x h, 
(f + g) xh = f xh + g xh, 
«(/ x 9) = (<*f) X 9 = f X (<*g)- 



It is an elementary calculation to show that if U is some subsect of R d then C[U] is an 
algebra with respect to the scalar held R. We can now dehne a subalgebra as following: 

Definition B.2 A set S is a subalgebra of the algebra y if 

1. S is a linear subspace of y , 

2. S is closed under the operation X . That is, if f and g are in S , then f X g is also 
in S . 

We can now formulate the Stone's theorem: 

Theorem B.l (Stone, 1937) Let X be a compact metric space, C[X] the set of continuous 
functions defined on X, and A a subalgebra of C[X] with the following two properties: 

1. the function f(x) = 1 belongs to A; 

2. for any two distinct points x and y in X there is a function f £ A such that f(x) ^ 

f(y)- 

Then A is dense in C[X]. 

As a simple application of this theorem we consider the set of gaussian superpositions, defined 
as 



g x = {fec[x\ |/(x) = 5>c" 



(x-t 8 ) 2 



X C R\ t t - £ R\ Ci, a % £ R,m£ N} 



101 



8 = 1 



We can now enunciate the following: 

Proposition B.l The set Qx is dense in C[X], where X is a compact subset of R d . 



II 



Proof: In order to use Stone's therorem, we first have to show that Qx is a subalgebra 
of C[X], for each compact subset X of R d . The set Qx will be a subalgebra of C[X] if the 
product of two of its elements yields another element of Qx- Since Qx is a linear superposition 
of gaussians of different variance and centered on different points it is sufficient to deal with 
the product of two gaussians. From the identity below it follows that the product of two 
gaussians centered on two points ti and t 2 is proportional to a Gaussian centered on a point 
t 3 that is a convex linear combination of ti and t 2 . In fact we have: 

(X-t:) 2 (X-t 2 ) 2 (X-t 3 ) 2 

e CT i • e CT 2 = ce "§ , 
_cr 2 ti + cr 1 t 2 2 _ G \ G 2 _ ^ — 

t3 ~~ 2~, 2 ' a 3 — ~^~, 2' C — e 3 • 

The function /(x) = 1 belongs to Qx } since it can be considered as gaussian of infinite 
variance, and for any distinct points x, y we can obviously find a function in Qx such that 
/(x) 7^ /(y) : the conditions of Stone's theorem are then satisfied and Qx is dense in C[X] 
D. 



C Regularization networks can approximate smooth 
functions arbitrarily well 

In this appendix we briefly describe the regularization method for approximating functions 
and show that the networks that are derived from a regularization principle can approximate 
arbitrarily well continuous functions defined on a compact subset of R n . 

Let S = {(x,-, yi) G R n x R\i = 1, ...N} be a set of data that we want to approximate by 
means of a function /. The regularization approach (Tikhonov, 1963; Tikhonov and Arsenin, 
1977; Morozov, 1984; Bertero, 1986) consists in computing the function / that minimizes 
the functional 

N 

#[/] = £(*-/(*)) 2 + A||i7ir 

8 = 1 

where P is a constraint operator (usually a differential operator), || • || 2 is a norm on the 
function space to whom Pf belongs (usually the L 2 norm) and A is a positive real number, 
the so called regularization parameter. The structure of the operator P embodies the a 
priori knowledge about the solution, and therefore depends on the nature of the particular 
problem that has to be solved. The general form of the solution of this variational problem 
is given by the following expansion (Poggio and Girosi, 1989): 

N 

/(x) = 5>G(x;x,-)+p(x) (11) 

8 = 1 

where G is the Green's function of the differential operator PP } P being the adjoint operator 
of P, p(x) is a linear combination of functions that span the null space of P, and the 
coefficients c 8 - can be found by inverting a matrix that depends on the data points (Poggio 
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and Girosi, 1989). We remind the reader that the Green's function of an operator PP is the 
function that satisfies the following differential equation (in the distributions sense): 

PPG(x;y) = %-y). (12) 

It is clear that there is a correspondence between the class of functions that can be written 
in the form (11) (for any number of data points and for any Green's functions G of a self- 
adjoint operator) and a subclass of feedforward networks with one layer of hidden units, of 
the type shown in figure 1. Under mild assumptions on PP } these networks can approximate 
continuous functions arbitrarily well, as is stated in the following proposition: 

Proposition C.l For every continuous function F defined on a compact subset of R a and 
every piecewise continuous G which is the Green's function of a self-adjoint differential op- 
erator, there exists a function /*(x) = J2i=i c iG(x.;x.i), such that for all x and any positive 
e the following inequality holds: 

iF(x)-r( X )i< e 



Proof: Let F be a continuous function defined on a compact set D C R a ■ Its domain of 
definition can be extended to all R a by assigning zero value to all points that do not belong to 
D. The resulting function, that we still call F, is a continous function with bounded support 3 . 
Consider the space K of "test functions" (Gelfand and Shilov, 1964), that consists of real 
functions </>(x) with continuous derivatives of all orders and with bounded support (which 
means that the function and all its derivatives vanish outside of some bounded region). As 
Gelfand and Shilov show (Appendix 1.1), there always exists a function </>(x) in K arbitrarily 
close to F, i.e. , such that for all x and for any e > 0, 

|F(x)-<Kx)|< e . 

Thus it is sufficient to show that every function </>(x) £ K can be approximated arbitrarily 
well by a linear superposition of Green's functions (function /* of proposition C.l). 
We start with the identity 

<Kx) = Jdy<f>(y)6(x-y) (13) 

where the integral is actually taken only over the bounded region in which </>(x) fails to 
vanish. By means of equation 12 we obtain 

#x) = |dy#y)(PPG)(x;y) (14) 

and since </>(x) is in K and PP is formally self-adjoint we have 

<f>(x) = JdyG(x;y)(PP<f>)(y). (15) 

3 The support of a continuous function F(x) is the closure of the set on which F(x) ^ 0. 
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We can rewrite equation 15 as 

<Kx)= /dyG(x;y)^(y) (16) 



where ip(x.) = PP<f)(x.). Since G(x; y)?/>(y) is piecewise continuous on a closed domain, this 
integral exists in the sense ol Riemann. By definition ol Riemann integral, equation 16 can 
then be written as 

<£(x) = A n Y, ^(x fc )G(x; x fc ) + £ X (A) (17) 

kei 

where x^ are points of a square grid of spacing A, I is the finite set of lattice points where 
?/>(x) ^ 0, and E X (A) is the discretization error, with the property 

lim EJA) = 0. (18) 

If we now choose /*(x) = A n J2kei V'( x fc)G'( x j x fc)? combining equation 18 and equation 17 
we obtain 

hm[<Kx)-r(x)] = 0. (19) 

Thus every function <f> G K can be approximated arbitrarily well by a linear superposition 
of Green's functions G of a self-adjoint operator, and this concludes the proof □. 

Remark: The conditions of proposition C.l exclude Green's functions that have singu- 
larities in the origin. An example is the Green's function associated with the "membrane" 
stabilizer P = V in 2 or more dimensions. In 2 dimensions, the membrane Green's function 
is G(r) = — logr, where r = ||x — x 8 || (in 1 dimension G(x) = \x\, satisfies the conditions of 
proposition C.l). 

Remark: Notice that in order to approximate arbitrarily well any continous function on 
a compact domain with functions of the type 11, it is not necessary to include the term p 
belonging to the null space of P. 
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